Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thunder benchmarks #3394

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Add thunder benchmarks #3394

wants to merge 10 commits into from

Conversation

Priya2698
Copy link
Collaborator

@Priya2698 Priya2698 commented Nov 12, 2024

Adds thunder as an additional executor to the baseline benchmarks and the corresponding thunder.jit function.
The following benchmarks do not have thunder benchmark:

  1. instancenorm: Unsupported operator in Thunder
  2. test_gelu_backward_reduction.py: .backward call is not supported within Thunder definitions. @IvanYashchuk has suggested using explicit backward computation for this case.

Issue #2718

@Priya2698
Copy link
Collaborator Author

!build

@Priya2698 Priya2698 marked this pull request as ready for review November 12, 2024 17:11
@liqiangxl
Copy link
Collaborator

!test --pybench

@liqiangxl
Copy link
Collaborator

Can you check how does the benchmark look like? Make sure there are no unexpected results.

@xwang233
Copy link
Collaborator

xwang233 commented Nov 14, 2024

!test --pybench-full --dev

see results here when pipeline finishes

@Priya2698 Priya2698 force-pushed the pm/add_thunder_bench branch from ce46a99 to ec697f0 Compare November 20, 2024 18:13
@Priya2698 Priya2698 marked this pull request as draft November 20, 2024 18:38
@Priya2698 Priya2698 force-pushed the pm/add_thunder_bench branch from ec697f0 to 2d9b5d3 Compare November 20, 2024 18:41
@Priya2698 Priya2698 force-pushed the pm/add_thunder_bench branch from 2d9b5d3 to 15a9f50 Compare December 9, 2024 20:46
@naoyam naoyam mentioned this pull request Dec 10, 2024
1 task
@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

@Priya2698 Priya2698 force-pushed the pm/add_thunder_bench branch from d69811f to ff1373f Compare December 10, 2024 19:20
@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

1 similar comment
@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

@@ -325,6 +327,9 @@ def run_benchmark(
def setup():
clear_l2_cache()
if device == "cuda":
for inp in inputs:
if isinstance(inp, torch.Tensor):
inp.grad = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this one. But this is only the cases where input requires gradient. Are we also clearing gradient on parameters?

Copy link
Collaborator Author

@Priya2698 Priya2698 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, for instance, weights in layernorm? Then, yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this works in code.

If benchmark_fn is not a function but a torch module, in that instance, the thunder program doesn't expect parameters to be among its inputs, I think it's stored in the thunder compiled thing. So I'm not sure how that's handled.

i.e. something like this

foo = torch.nn.Linear(4, 5).cuda()
inp = torch.randn(8, 4, device="cuda")
benchmark_fn = with_executor(foo, "thunder")
# ...
run_benchmark(benchmark, unary_bwd_torch, [output, grad], ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you're right.
Even with clearing the gradients of weights, bias and inputs in backward pass, I think it is still missing some variables/internal states that need to be reset.
The simplest way is to only run 1 round for backward, but I feel that may be noisy, so have trying to make it work for multiple runs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got'ya. No worries. I'm not totally clear what's the protocol in thunder on ownership of parameters, I think it's supposed to be a functional compilation.
So we can still expect that with_executor has the chance to extract parameters from nn.Module if it's given for benchmark and we should be able to identify parameter that needs zero_grad. just like an optimizer would.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this could also contribute to potential performance diff.

if there are parameter requiring grad, thunder will generate backward graph and save intermediates, regardless of whether backward is being called or not.

@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

1 similar comment
@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

@Priya2698 Priya2698 marked this pull request as ready for review December 19, 2024 00:50
@Priya2698
Copy link
Collaborator Author

We still see some performance differences between Thunder and nvfuser manual definitions, for some operators, which may be due to slight differences in the fusion definitions generated. I will look into this operator-wise.

I have compared the measured timings against nsys timings for some operators to verify the accuracy of the benchmark infra. My recommendation is to check in these changes adding Thunder benchmarks, while we investigate the performance gap.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there're still quite some questions on what's the next step and how to automate things like IOBytes computation and stuff.

But for this PR, things look pretty mechanical, so I'm good with merging it as-is then iron out the remaining issues.

My only concern is, would the disruption rendering our benchmark not reliable until we fix all these issues? and does it cause issues on folks looking at the benchmarks?

@@ -115,6 +121,6 @@ def test_softmax_bwd_baseline_benchmark(
run_benchmark(
benchmark,
unary_bwd_torch,
[outputs, grads],
[outputs, grads, *fwd_inputs],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneaky! I see unary_bwd_torch discards further inputs, but you pass them here to clear the grad?
We can use a comment in unary_bwd_torch on why we are not asserting num of inputs.

@Priya2698
Copy link
Collaborator Author

My only concern is, would the disruption rendering our benchmark not reliable until we fix all these issues? and does it cause issues on folks looking at the benchmarks?

The thunder.jit performance numbers match against nsys. They also compare correctly for operators like scale-bias-relu, softmax, silu-mul against the manual nvfuser definitions. But for others like normalization, the performance is different. So my next step is to actually compare the fusion definitions, and account how each difference in the 2 fusion definitions correspond to the performance gap (for instance, the manual fusion definitions have some missing downcasts for intermediate ops, or the ordering of ops etc).

So by performance gap, I imply the latter case. The measurements of the benchmarks should be accurate.
This PR is actually fixing one of the measurement issues existent in the backward benchmarks due to grad accumulation.

@Priya2698
Copy link
Collaborator Author

I think there're still quite some questions on what's the next step and how to automate things like IOBytes computation and stuff.

Agreed! Let me open an issue tracking the IOByte case. I will create an issue for the differences between thunder.jit and nvfuser manual definition performance shortly (I want to do some initial analysis).

@Priya2698
Copy link
Collaborator Author

!build

@Priya2698 Priya2698 force-pushed the pm/add_thunder_bench branch from 84bb09c to b7e0b3e Compare January 16, 2025 02:29
Copy link

github-actions bot commented Jan 16, 2025

PR Reviewer Guide 🔍

(Review updated until commit 083d89d)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
⚡ Recommended focus areas for review

Changed Function Signature

The unary_bwd_torch function signature has been changed to accept additional arguments. This change may affect the functionality of the benchmarks.

def unary_bwd_torch(inputs: List):  # [output, grad_out, fwd_inputs]
    inputs[0].backward(inputs[1], retain_graph=True)
Added New Executor

A new executor named "thunder" has been added to the DEFAULT_EXECUTORS list. This change may affect the behavior of the benchmarks.

DEFAULT_EXECUTORS = ["eager", "torchcompile", "thunder"]
Modified Benchmark Function

The run_benchmark function has been modified to clear the L2 cache before running the benchmark. This change may affect the performance measurements.

def setup():
    clear_l2_cache()
    if device == "cuda":
        for inp in inputs:
            if isinstance(inp, torch.Tensor):
                inp.grad = None
        return [inputs], {}

@Priya2698
Copy link
Collaborator Author

!test --pybench-full --dev

@Priya2698 Priya2698 requested a review from jjsjann123 January 16, 2025 22:38
@Priya2698
Copy link
Collaborator Author

Priya2698 commented Jan 16, 2025

@jacobhinkle @liqiangxl pinging for review.

We continue to see performance difference between Thunder-nvfuser and nvfuser. The thunder-nvfuser executor will not run by default so we can continue investigating this in parallel. This PR has the fix for bwd grad accumulation issue, so we can merge this. If the consensus is to hold off on adding Thunder-nvfuser benchmarks altogether, I can remove it from the executors list for benchmarks as well. That change is minimal.

Latest benchmark run: http://nv/euP

  1. nvFuser performance remains the same.
  2. Eager and torch.compile remain the same for fwd and are faster for bwd.
  3. Thunder-nvfuser and nvfuser performance is different in some cases, mostly bwd pass. See Issue Performance gap between manual nvfuser definition and thunder.jit #3629 for an example.

@jjsjann123 Can you give me the numbers you expect for RoPE bwd after the grad accumulation issue is resolved -- I can verify the numbers with this PR?

@@ -23,6 +22,8 @@
L2_CACHE_SIZE = DEVICE_PROPERTIES["gpu_l2_bytes"]
PEAK_BANDWIDTH_GBPS = DEVICE_PROPERTIES["gpu_peak_bandwidth_gbps"]

DEFAULT_EXECUTORS = ["eager", "torchcompile", "thunder"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be named differently since these are not run in nightly, but for most benchmarks, these are the set of executors we execute weekly. We also have thunder-torchcompile for RoPE.
Maybe BASELINE_EXECUTORS is better, although Thunder is not really a baseline.

@Priya2698
Copy link
Collaborator Author

!build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants